Giulio Formenti, Ph.D.
Laboratory of
Neurogenetics of Language
Vertebrate Genome Laboratory
The Rockefeller University
gformenti@rockefeller.edu
Oliver Fedrigo, Ph.D.
Vertebrate Genome Laboratory
The Rockefeller University
ofedrigo@rockefeller.edu
In collaboration with the Vertebrate Genomes Laboratory
March 22nd & 25th - Monday & Thursday
Path on RU HPC to test data: /fakepath/to/data
«A knowledge of sequences could contribute much
to our understanding of living matter»
Frederick Sanger, 1980
To learn more: Giani et al., 2020
Refined partition chromatography:
The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence.
To learn more: Giani et al., 2020
Refined partition chromatography:
The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence.
To learn more: Giani et al., 2020
Refined partition chromatography:
The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence.
To learn more: Giani et al., 2020
Rodger Staden, invents of the first DNA sequencing ‘software’.
In 1982, Sanger uses it to assemble the entire 48,502 bp of bacteriophage Lambda genome.
To learn more: Giani et al., 2020
The process of determining the sequence of an organism without existing reference.
NGS
- Bridge amplification
- Short reads
- High throughput
- High quality
Pacbio (TGS)
- Single molecule
- Long reads
- Lower throughput
- Lower quality
High-quality error-free genome assemblies and annotations are necessary as current 1st and 2nd generation genome sequencing approaches generate numerous errors that cause a variety of problems in downstream analyses. Parts of genes are missing, and some are incorrectly assembled, while others are completely missing from the assemblies despite pieces found in the raw sequence reads. (Vertebrate Genomes Project)
An open, community-based effort to generate the first complete assembly of a human genome.
Hifi reads are nearly perfect in homopolymer-compressed space.
AATTCTACTCATAT__AAAAA__TCA__TTTTTT__CA → AATTCTACTCATAT__A__TCA__T__CA
Nurk et al., in preparation
Long-range interactions. Used also to reconstruct the 3D structure of DNA.
Unfortunately, there are: potential alignments between 2 sequences of length N
That is, with sequence length = 100:
2(2*200)/(3.14*100)(1/2) = 9.068476 × 10^58 alignments
1970: Needleman–Wunsch algorithm
Sequences
GCATGCU
GATTACA
Best alignments
GCATG-CU
G-ATTACA
GCA-TGCU
G-ATTACA
GCAT-GCU
G-ATTACA
banana → bamana (substitution of “n” for “m”)
bamana → bambna (substitution of “a” for “b”)
bambna → bambina (insertion of “i”).
EDIT DISTANCE = 3
You can calculate the edit distance for all possible alignments and choose the alignment that minimizes the edit distance (for longer sequences find an heuristic). It has been shown it is mathematically equivalent to optimal matching.
1981: Smith–Waterman algorithm
Compares segments of all possible lengths and optimizes the similarity measure
The main difference to the Needleman–Wunsch algorithm is that negative scoring matrix cells are set to zero, making positively scoring local alignments visible. Traceback procedure starts at the highest scoring matrix cell and proceeds until a cell with score zero is encountered, yielding the highest scoring local alignment
It finds the optimal “local” alignment (best local solution)
For slightly divergent sequences
Quadratic complexity in time and space, therefore it often cannot be practically applied to large-scale problems → you need linear solutions
Alignment-free approaches have been used for:
Short read assemblers:
- SGA String graph
- ValVel String graph
- DISCOVAR DBG
- SOAPdenovo DBG
- Euler DBG
- ABySS DBG
- Velvet DBG
- SPAdes DBG
- Edena OLC
- Ray Hybrid
- SSAKE Greedy
- Perga Greedy
- …
Long read assemblers:
- Hifiasm OLC
- Canu/HiCanu (ex Celera) OLC
- Peregrine HGAP/OLC
- Falcon-Unzip HGAP/OLC
- Flye Repeat graph
- …
Unrivalled software for assembly-free estimates is Genomescope (v2.0, 2020)
A genome assembly is the entire genomic sequence derived through a de novo (i.e. reference-free) assembly process of the raw sequencing reads and released by the curators of the genome in the database upon publication.
The Primary Assembly constitutes what Genbank curators consider the most up-to-date source for the genomic sequence for this reference.
When available, this is the first result being shown.
Picking a random human gene, the Primary Assembly normally refers to the human genome assembly ‘GRCh38’ (Genome Reference Consortium human build n. 38), the latest release of the long list of high-quality assemblies for the human genome generated since 2001.
Differences between two humans:
Human-chimp differences (120 Mb overall):
3x10^7 substitutions → 30 Mb (25%)
5x10^6 indels (<80 bp) → 22 Mb (18%)
7x10^4 SVs (>80 bp) → 68 Mb (57%)